A Handbook of Test Construction (Psychology Revivals) by Paul Kline

A Handbook of Test Construction (Psychology Revivals) by Paul Kline

Author:Paul Kline
Language: eng
Format: epub
ISBN: 978-1-317-44459-6
Publisher: Taylor & Francis Ltd


The meaning of true scores

In chapter 1 great care was taken to define the meaning of true scores – the scores on the infinite universe of items – for effectively this is the critical point at issue.

Since I shall argue here that the importance of internal-consistency reliability has been exaggerated in psychometry (i.e. I agree with Cattell), and that it can be antithetical to validity, it is essential to state that I fully accept the statistical arguments previously advanced. However, what is not brought out in the mathematical treatment (and this is why true scores are the critical issue) is the psychological significance of true scores as theoretically defined. Examples will best clarify the viewpoint.

Suppose that we are trying to measure a variable such as verbal ability. It is highly likely that the items which appear to tap verbal ability do in fact do so; for example, vocabulary, definitions, synonyms, antonyms, construction of artificial languages with grammars, précis, comprehension and summarization. This is to say that verbal ability is a relatively homogeneous set of skills clearly defined and bounded. It would be highly surprising if subjects good at précis were not good at comprehension and had poor vocabularies. This means that there is good psychological reason to expect that a proper sample of items would be internally consistent, homogeneous and reliable, and that any items that could not be thus defined were, in all probability, measuring a variable other than verbal ability. In this case, therefore, the fallible test would be expected to be highly reliable because the universe of true items was itself homogeneous. Indeed, most good tests of ability do have high alpha coefficients because in the sphere of abilities each factor is generally distinct and discreet. If a test is valid – that is if its items are from the universe of items which we intend – in the ability sphere high reliability is probably a sine qua non.

However, this example also gives a clue to the argument against too high reliability, that high reliability is antithetical to high validity. Les us suppose that our test of verbal ability consists of antonyms, synonyms, comprehension, vocabulary and précis questions. Such measures when well constructed have high reliabilities around 0.90. However, if in the quest for high reliability we were to use only one item type, say antonyms, this reliability could indubitably be raised. However, it is clear, hopefully, to most readers that this latter test is highly unlikely to be a more valid test of verbal ability.

In terms of the classical-error model, we can clearly see why this test of higher reliability is less valid. The high reliability of the antonyms test reflects the fact that our sample of test items (antonyms) closely correlates with the hypothetical universe of items, that is all possible antonyms. However, this true score reflects not verbal ability but ability at antonyms. Thus, by limiting our items and constructing the universe of items, reliable tests can be made but only at the expense of validity.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.